Locking in Week 4 — and assembling the Part A chain: taxonomy → red-team → eval
Day 20 of 60
Four weeks in, you can do something most people who "care about AI safety" cannot: produce a number you'd defend. You separate evaluation from red-teaming (measurement vs discovery), you measure safety on two axes instead of one, you've built a runnable scorecard that proves the single number lies, and you can design an eval set honest enough to survive its own contamination and degenerate-model checks.
A safety eval is only as trustworthy as the failure it can't hide. The whole week turns on one move: measure both harmful-compliance and over-refusal, because a model that refuses everything scores perfectly on one axis and is useless. One number can be gamed; the two-sided scorecard can't.
This is the week the first part of the track clicks into a single story. Each artifact you built feeds the next, and being able to tell that chain end-to-end is the real deliverable of Part A:
Categories and severity tiers turn "is this bad?" into a labelable question. Nothing downstream can be measured until this exists. Your eval's harmful prompt classes are sampled from these categories.
Open-ended discovery surfaces the failures the taxonomy categories actually manifest as in this model. Red-team incidents become eval cases — discovery refilling the measurement set.
The two-sided scorecard, run on an honest set, tells you whether the model is safe and still useful — repeatably, so you can compare across models and time. The number a release decision can rest on.
Anyone can describe one artifact. The hire-able skill is narrating the whole pipeline — how a taxonomy makes red-teaming labelable, how red-team finds make an eval set real, how the eval turns it all into a defensible number — and then naming the weakest link in your own chain and how you'd strengthen it. Owning the chain is owning the safety story end to end.
A practitioner ships an eval and reports its number. An expert treats the eval as a claim to be attacked — and can tell the whole Part A chain as one argument, then point unprompted at its weakest link. The altitude jump is from owning an artifact to owning a pipeline: the ability to say not just "the model scored X," but "here's how taxonomy, red-team, and eval combine to make X mean something, and here's exactly where I'd distrust it."
Say this in an interview: "Part A is one chain for me: a taxonomy defines harm, red-teaming discovers how it shows up, and a two-sided eval measures it repeatably — both harmful-compliance and over-refusal, so the number can't be gamed by refusing more. And I'll tell you the weakest link in my own harness before you ask, because the eval I trust is the one I've already tried to break."